Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the
'fnlwgt'feature and records with missing or ill-formatted entries.
Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.
Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.
Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.
These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).
For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).
The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).
To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Why does this matter: The extreme points may affect the performance of the predictive model.
Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.
Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.
Again, to reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Originally, the influence of capital_loss on income was statistically significant, but after the logarithmic transformation, it is not.
Here it can be seen that with a change to the skew, the confidence interval now passes through zero whereas before it did not.
This passing through zero is interpreted as the independent variable being statistically indistinguishable from zero influence on the dependent variable.
| Feature | Skewness | Mean | Variance |
|---|---|---|---|
| Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.
Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).
Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.
However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.
Note the output from scaling. age is no longer 39 but is instead 0.30137. This value is meaningful only in context of the rest of the data and not on its own.
Earlier, I transformed some categorical values into a numeric mapping. Another, perhaps more common, way to do this is to make dummy variables from the values of those factors. Pandas has a simple method, .get_dummies(), that can perform this very quickly.
To note, this will create a new variable for every value a categorical variable takes:
| someFeature | someFeature_A | someFeature_B | someFeature_C | ||
|---|---|---|---|---|---|
| 0 | B | 0 | 1 | 0 | |
| 1 | C | ----> one-hot encode ----> | 0 | 0 | 1 |
| 2 | A | 1 | 0 | 0 |
Which means the p, or number of factors, will grow, and can do so potentially in a large way.
It is also worth noting that for modeling, it is important that once value of the factor, a "base case", be dropped from the data. This is because the base case is redundant, i.e. can be infered perfectly from the other cases, and, more specifically and more detrimental to our model, it leads to multicollinearity of the terms.
In some models (e.g. logistic regression, linear regression), an assumption of no multicollinearity must hold.
After transforming with one-hot-encoding, all categorical variables have been converted into numerical features. Earlier, they were normalized (i.e. scaled between 0 and 1).
Next, for training a machine learning model, it is necessary to split the data into segments. One segment will be used for training the model, the training set, and the other set will be for testing the mode, the testing set.
A common method of splitting is to segment based on proportion of data. A general 80:20 rule is typical for training:test.
sklearn has a method that works well for this, .model_selection.train_test_split. Essentially, this randomly selects a portion of the data to segment to a training and to a testing set.
By setting a seed, option random_state, we can ensure the random splitting is the same for our model. This is necessary for evaluating the effectiveness of the model. Otherwise, we would be training and testing a model with the same proportional split (if we kept that static), but with different observations of the data.